Our Question

With our data, we wanted to examine the relationship between the price of Airbnb properties, and the neighborhood it is listed in. More specifically, we wanted to see if our model could predict a neighborhood based off of price, and a few other variables we chose later. We thought this would be an interesting dataset to explore with the growth of the tourism industry as vaccination rates rise and cities start to open up. Through our exploration and modeling,

We also wanted a way to predict the rating of the Airbnb properties. We thought this would be a good addition to our project, since the deal one would get booking an Airbnb does not matter a lot if the experience is bad. We want to find a way to find a good deal with a high rating.

Our dataset was obtained from Kaggle but the data itself was sourced from Inside Airbnb.

price was used 100% of the time, which is good since our main goal was to examine the relationship between price and neighborhood.

Cleaning our Data

The first thing we did was read in our data.

boston <- read.csv("Boston_Airbnb_copy.csv")

Next we sub-setted columns we thought would make for interesting predictions, renamed columns so that they would be easier to call, and made ‘price’ and ‘cleaning_fee’ numeric variables.

boston <- boston%>%
  select(c("name","neighbourhood_cleansed","latitude", "longitude", "room_type", 
           "accommodates", "price", "review_scores_rating", 
           "host_is_superhost", "property_type"))

boston <-  boston%>%
  rename(superhost = host_is_superhost, neighborhood = neighbourhood_cleansed)

boston$price = as.numeric(gsub("\\$", "", boston$price))

Data Exploration

Here we looked at neighborhood counts, and saw that the most popular neighborhoods are Jamaica Plain, South End, Back Bay, Fenway, Dorchester, and Allston.

We thought grouping by neighborhood would be an easy way to do some basic exploration

table(boston$neighborhood)
## 
##                 Allston                Back Bay             Bay Village 
##                     260                     302                      24 
##             Beacon Hill                Brighton             Charlestown 
##                     194                     185                     111 
##               Chinatown              Dorchester                Downtown 
##                      71                     269                     172 
##             East Boston                  Fenway               Hyde Park 
##                     150                     290                      31 
##           Jamaica Plain        Leather District   Longwood Medical Area 
##                     343                       5                       9 
##                Mattapan            Mission Hill               North End 
##                      24                     124                     143 
##              Roslindale                 Roxbury            South Boston 
##                      56                     144                     174 
## South Boston Waterfront               South End                West End 
##                      83                     326                      49 
##            West Roxbury 
##                      46

Median Values Grouped by Neighborhood

## could be interesting to compare to means
bost_med_table <- boston%>%
  group_by(neighborhood)%>%
  summarise(medianReview = median(review_scores_rating, na.rm=T),
            medianPrice = median(price, na.rm=T))

bost_med_table
## # A tibble: 25 × 3
##    neighborhood medianReview medianPrice
##    <chr>               <dbl>       <dbl>
##  1 Allston              94           85 
##  2 Back Bay             93          209 
##  3 Bay Village          93.5        206.
##  4 Beacon Hill          95          195 
##  5 Brighton             94           90 
##  6 Charlestown          96          178.
##  7 Chinatown            95          219 
##  8 Dorchester           93           72 
##  9 Downtown             94          225 
## 10 East Boston          92           99 
## # … with 15 more rows

Median Prices by Neighborhood

plot <- ggplot(data=bost_med_table, aes(x=neighborhood, y=medianPrice, fill=medianPrice))+
  scale_fill_gradient(low = "dark red", high = "cornflowerblue")+
  geom_bar(stat='identity')+
  theme(axis.text.x = element_text(angle=90))+
  labs(x="Neighborhood", y="Median Price of Airbnb per Night", title="Distribution of Airbnb prices per night over Neighborhoods in Boston")


plot

Room Type by Neighborhood

boston%>%
  group_by(neighborhood)%>%
  select(room_type)%>%
  table()
## Adding missing grouping variables: `neighborhood`
##                          room_type
## neighborhood              Entire home/apt Private room Shared room
##   Allston                              98          156           6
##   Back Bay                            263           36           3
##   Bay Village                          20            4           0
##   Beacon Hill                         155           36           3
##   Brighton                             75          103           7
##   Charlestown                          68           42           1
##   Chinatown                            62            8           1
##   Dorchester                           66          195           8
##   Downtown                            144           24           4
##   East Boston                          70           77           3
##   Fenway                              208           73           9
##   Hyde Park                             6           24           1
##   Jamaica Plain                       157          181           5
##   Leather District                      3            2           0
##   Longwood Medical Area                 4            4           1
##   Mattapan                              3           21           0
##   Mission Hill                         48           68           8
##   North End                           119           21           3
##   Roslindale                           19           37           0
##   Roxbury                              58           81           5
##   South Boston                        102           69           3
##   South Boston Waterfront              71           12           0
##   South End                           250           69           7
##   West End                             43            6           0
##   West Roxbury                         15           29           2

Top 100 Most Expensive, by Neighborhood

We see here that the top 100 most expensive Airbnbs are pretty evenly distributed over the neighborhoods. Back Bay has the highest number with 17 but doesn’t seem so high that it is an outlier. This also makes sense because Back Bay was one of the most popular neighborhoods in the first place.

boston_top100 <- boston%>%
  arrange(desc(price))

head(boston_top100, 100)%>%
  select(neighborhood)%>%
  table()
## .
##                 Allston                Back Bay             Bay Village 
##                       3                      17                       5 
##             Beacon Hill                Brighton             Charlestown 
##                       9                       3                       4 
##                Downtown                  Fenway           Jamaica Plain 
##                       3                       6                      11 
##            Mission Hill               North End                 Roxbury 
##                       1                       2                       7 
##            South Boston South Boston Waterfront               South End 
##                       8                       7                      13 
##                West End 
##                       1

Clustering

First we picked which variables we thought would do the best at predicting neighborhoods. We chose to use price and rating, and room type since that is where we saw the greatest variation by neighborhood in our exploration.

clust_boston <- boston[, c("price", "review_scores_rating", "room_type")]

Formatting room_type to be usable in clustering

table(clust_boston$room_type)
## 
## Entire home/apt    Private room     Shared room 
##            2127            1378              80
clust_boston$room_type <- fct_collapse(clust_boston$room_type,
                                  v1 = "Entire home/apt",
                                  v2 = "Private room",
                                  v3 = "Shared room")

clust_boston$room_type = as.numeric(gsub("v", "", clust_boston$room_type))

There are a few NAs in the price and rating columns, so we replaced them with the median

clust_boston$review_scores_rating[is.na(clust_boston$review_scores_rating)] <- median(clust_boston$review_scores_rating, na.rm=T)
clust_boston$price[is.na(clust_boston$price)] <- median(clust_boston$price, na.rm=T)

sum(is.na(clust_boston$review_scores_rating))
## [1] 0
sum(is.na(clust_boston$price))
## [1] 0

And then scaled our numeric variables

normalize <- function(x){
 (x - min(x)) / (max(x) - min(x))
}

clust_boston[1:2] <- lapply(clust_boston[1:2], normalize)

Then we sorted the neighborhoods into 2 groups, central Boston and Boston suburbs.

boston$neighborhood_groups <- fct_collapse(boston$neighborhood,
                                           Suburbs = c("Jamaica Plain", "Roslindale", 
                                                       "Dorchester","Roxbury", 
                                                       "West Roxbury", "Hyde Park", 
                                                       "Mattapan", "Brighton", "Allston"),
                                           Central_Boston = c("Bay Village", "Back Bay",
                                                              "Beacon Hill", "West End",
                                                              "North End", "Downtown",
                                                              "South End", "Chinatown",
                                                              "Leather District", "Fenway", 
                                                              "Mission Hill", "Longwood Medical Area", 
                                                              "South Boston", "South Boston Waterfront", 
                                                              "Charlestown", "East Boston"))

And we used the elbow method to figure out how many clusters to choose.

Based on this graph, it looks like 2 clusters will give us the best model without over-fitting. This also makes sense because we collapsed the neighborhoods into 2 groups.

explained_variance = function(data_in, k){
  set.seed(1)
  kmeans_obj = kmeans(data_in, centers = k, algorithm = "Lloyd", iter.max = 30)
  var_exp = kmeans_obj$betweenss / kmeans_obj$totss
  var_exp  
}

explained_var_boston = sapply(1:10, explained_variance, data_in = clust_boston)
explained_var_boston
##  [1] -7.690460e-15  8.696275e-01  8.696275e-01  9.369267e-01  8.906499e-01
##  [6]  8.988897e-01  8.991225e-01  9.664217e-01  9.768941e-01  9.801408e-01
elbow_boston = data.frame(k = 1:10, explained_var_boston)
ggplot(elbow_boston, 
       aes(x = k,  
           y = explained_var_boston)) + 
  geom_point(size = 4) +
  geom_line(size = 1) + 
  xlab('k') + 
  ylab('Inter-cluster Variance / Total Variance') + 
  theme_light()

So we ran the clustering algorithm with 2 clusters

set.seed(123)
kmeans_obj_boston = kmeans(clust_boston, centers = 2, 
                        algorithm = "Lloyd")

kmeans_obj_boston
## K-means clustering with 2 clusters of sizes 2127, 1458
## 
## Cluster means:
##        price review_scores_rating room_type
## 1 0.21318566            0.9089093   1.00000
## 2 0.08432192            0.8989626   2.05487
## 
## Clustering vector:
##    [1] 1 2 2 2 2 2 1 2 2 1 2 2 1 1 1 2 2 2 2 2 2 1 2 1 1 2 2 2 2 2 2 1 1 2 2 1 2
##   [38] 1 1 2 2 2 2 2 1 2 2 2 2 2 2 1 1 1 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##   [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 2 1 1 1 2 2 2 2 1 1 1 2 2 1
##  [112] 1 1 1 1 1 1 1 1 2 1 2 2 2 1 2 1 1 2 1 1 1 1 2 1 2 2 2 1 2 2 2 2 1 1 1 2 2
##  [149] 2 1 2 1 1 2 1 2 2 2 1 1 2 2 1 1 1 2 1 2 1 2 1 1 2 2 1 1 2 2 1 1 1 1 1 2 2
##  [186] 2 1 1 1 1 1 2 2 2 1 2 1 2 1 1 2 2 2 2 1 2 2 1 1 1 1 1 1 2 1 2 1 1 2 2 1 1
##  [223] 2 2 1 2 2 2 1 2 2 1 2 2 1 2 1 1 2 2 1 2 1 2 2 1 2 1 2 1 1 1 1 1 1 1 2 2 1
##  [260] 1 2 1 2 2 2 2 2 2 2 2 2 2 1 1 1 2 1 2 1 1 1 1 1 2 2 1 2 1 2 2 2 1 2 1 2 1
##  [297] 1 1 1 2 2 2 1 2 2 1 2 2 2 1 1 2 2 2 2 2 1 1 1 1 2 1 1 2 2 1 1 2 1 1 1 1 1
##  [334] 1 1 2 1 2 2 2 2 1 2 2 1 2 2 2 1 1 2 1 2 1 1 1 1 2 2 2 2 1 2 2 1 1 1 1 2 2
##  [371] 2 2 2 1 2 1 2 1 2 1 2 2 2 2 2 2 1 2 2 2 2 1 2 1 1 1 1 1 1 1 1 1 1 2 2 2 2
##  [408] 2 2 2 2 2 2 2 1 2 2 2 2 2 1 2 1 2 2 2 2 2 1 2 2 2 1 2 2 1 2 2 1 1 1 1 1 1
##  [445] 2 2 1 1 1 2 2 1 1 2 1 1 2 2 2 2 2 2 2 2 2 1 2 1 1 1 2 2 2 1 2 2 2 2 1 2 1
##  [482] 1 2 2 2 2 1 2 1 2 1 1 1 2 2 1 2 2 2 1 2 2 2 1 2 2 2 2 2 1 2 1 1 1 1 2 1 1
##  [519] 2 1 1 2 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 2 1 1 1 1 2 2 1 2 1 1
##  [556] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 2 1 2 1 1 1 1 1 1 1 2 1 1 1
##  [593] 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1
##  [630] 1 2 2 2 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 2 1 1 1 1 1 1 1
##  [667] 2 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1
##  [704] 1 1 1 2 2 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1
##  [741] 2 1 1 2 2 1 1 1 2 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 2 1 1 2 1 2 2 2
##  [778] 2 2 2 2 2 2 1 2 2 1 1 2 2 2 1 1 2 1 1 2 2 2 2 1 1 1 2 2 2 1 1 1 2 1 1 2 1
##  [815] 2 2 2 2 1 1 2 2 1 2 1 2 1 2 2 2 2 1 1 1 1 2 1 2 2 2 1 1 1 2 1 2 1 2 2 1 2
##  [852] 2 2 2 2 1 1 2 2 1 2 2 1 2 1 1 2 1 1 2 1 1 2 1 1 2 2 1 2 2 2 2 2 2 1 2 2 2
##  [889] 1 1 2 2 2 2 1 1 2 2 1 2 2 2 2 2 2 1 1 1 1 2 2 2 2 1 1 2 1 1 1 1 1 1 1 1 1
##  [926] 1 1 1 2 1 2 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 2
##  [963] 1 1 2 1 2 1 1 1 1 1 1 1 1 2 1 1 2 1 1 2 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 2
## [1000] 1 1 1 2 1 1 1 2 1 1 2 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 2 1 1 2 1 1 1 1 1
## [1037] 1 1 2 2 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 2 1 2 1 1 2 1 1 1 1 1 2
## [1074] 2 1 2 1 1 1 1 2 1 2 1 1 1 1 2 1 2 1 2 2 1 1 1 1 1 1 1 1 1 1 1 2 2 2 1 1 1
## [1111] 1 1 2 1 2 1 1 1 1 2 2 1 1 1 2 1 1 2 1 1 2 2 1 1 1 1 1 2 1 1 2 1 1 1 2 1 1
## [1148] 2 1 1 2 1 1 1 1 1 2 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 2 2 1 1 1 1
## [1185] 2 2 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 2 2 1 2 1 1 1 1 1 1 1 1
## [1222] 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 1 1 1 1 1
## [1259] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1
## [1296] 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 2 1 2 1 1 1 1 1 1
## [1333] 1 2 1 2 1 1 1 1 1 1 1 1 1 1 2 1 2 2 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 2
## [1370] 1 1 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 2 1 2 1 1 1 1 2 1 1 1 1 1 1 1 1
## [1407] 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1
## [1444] 1 1 1 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [1481] 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 2 1 1 1 1
## [1518] 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 2 1 1 2 2 1 1 1 1 1 2 1 2 2 2 2 2 2 2 2 1
## [1555] 2 1 2 2 2 2 2 2 1 2 2 1 2 2 2 1 1 2 1 1 1 1 1 2 2 2 1 1 1 1 2 2 2 2 1 1 2
## [1592] 1 1 1 2 2 2 1 1 2 2 2 1 2 2 1 1 1 2 2 1 1 1 2 2 1 1 2 1 1 1 1 2 2 2 2 1 1
## [1629] 1 2 2 1 1 2 1 1 2 1 2 1 1 2 2 1 1 1 1 1 1 2 1 1 1 2 2 2 2 1 2 2 1 2 1 2 2
## [1666] 1 2 2 1 2 1 2 1 1 2 1 2 1 2 1 2 1 2 2 1 2 2 2 2 2 2 1 1 1 2 2 1 1 1 2 1 1
## [1703] 1 1 1 1 1 1 2 2 1 2 1 2 1 1 1 1 1 1 1 1 2 2 1 2 2 1 1 1 1 1 2 1 1 2 1 1 1
## [1740] 2 1 1 1 2 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 2 1 1 1 2 2 1 2
## [1777] 2 1 2 1 2 1 1 1 1 1 1 2 2 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1
## [1814] 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 1 1 2 1 1
## [1851] 1 1 2 2 1 1 2 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 2 1 1 1 1
## [1888] 2 1 1 1 1 2 1 1 1 1 1 2 1 1 1 2 1 1 1 2 1 1 2 1 2 1 1 2 1 1 1 2 1 1 1 1 1
## [1925] 1 1 1 1 1 1 2 2 1 1 1 1 1 1 2 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [1962] 1 1 1 2 1 1 2 1 2 2 1 1 1 1 1 1 1 1 2 2 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1
## [1999] 2 1 1 1 1 1 1 1 1 2 1 2 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1
## [2036] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 2 1 1 1 2 2 2 1 1 2 1 1 1
## [2073] 1 1 1 2 1 1 1 2 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 2 1 1
## [2110] 1 2 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 2 2 1 2 1 1 1 1 1 1 1 1 1 2 2 1 1 1
## [2147] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 2 1 2 1 1 1 1 1
## [2184] 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 2 2 1 1 2 1 2 1 1 1 2 2 2 1 1 1 1
## [2221] 1 1 1 1 1 2 1 2 1 1 2 1 1 1 1 1 1 1 2 1 1 2 2 1 1 1 1 1 2 1 2 2 1 2 2 1 1
## [2258] 1 2 2 1 2 2 1 2 1 2 1 2 1 1 2 2 1 1 1 1 2 2 1 1 1 1 1 2 1 1 2 2 1 1 1 2 1
## [2295] 1 2 1 1 1 2 1 1 1 1 2 1 1 1 2 1 2 1 1 1 1 2 2 1 2 1 2 1 1 1 1 1 1 2 1 2 1
## [2332] 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1
## [2369] 1 1 1 1 2 1 1 1 2 1 2 1 2 1 1 2 1 1 2 1 1 2 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2
## [2406] 1 2 1 1 1 1 1 1 2 1 2 1 2 1 2 1 1 2 1 1 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1 2 1
## [2443] 1 2 1 2 1 1 2 1 2 1 1 2 1 2 1 1 2 1 1 1 1 1 2 1 2 1 2 2 1 1 1 1 1 1 1 1 2
## [2480] 1 1 1 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 1 2 1 2 2 2 2 1 2 2 2 1
## [2517] 2 1 1 2 2 1 2 2 2 1 1 1 2 2 2 2 2 2 2 2 1 2 1 1 2 2 1 2 1 2 1 2 2 1 1 2 2
## [2554] 2 1 2 1 1 2 1 2 2 2 2 2 2 1 1 1 2 2 1 1 2 2 1 2 2 1 1 1 2 2 2 2 2 2 2 1 2
## [2591] 2 1 2 1 1 2 2 1 1 2 2 1 2 1 1 2 2 2 1 1 1 1 2 2 2 2 1 2 1 2 2 1 1 1 2 2 1
## [2628] 1 2 2 1 1 1 1 1 1 1 2 2 1 2 1 2 1 2 1 2 2 1 1 1 1 1 1 2 2 1 1 2 1 2 2 2 2
## [2665] 2 1 2 2 2 1 2 2 1 1 2 2 1 2 2 2 2 1 2 1 1 1 2 2 2 1 2 2 2 2 2 2 2 2 1 2 2
## [2702] 2 2 2 1 2 1 1 2 1 2 1 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2
## [2739] 1 2 1 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 1 2 2 2 1 2 2 2 2 2 2 1 1 2 1 2 2 2 1
## [2776] 2 2 2 2 2 1 2 2 2 1 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2
## [2813] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 1 2 1 2 1 2 2 1 2 2 2
## [2850] 2 2 2 2 1 1 1 2 2 2 1 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2
## [2887] 2 2 2 2 1 2 1 1 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 2 2
## [2924] 1 1 2 2 2 1 2 2 2 1 2 2 1 1 1 2 2 2 2 2 2 2 1 1 2 1 2 2 2 1 2 2 2 2 2 2 2
## [2961] 1 2 2 2 2 1 1 2 1 1 1 1 2 1 2 1 2 2 2 2 2 2 2 2 1 1 1 2 2 2 2 2 2 1 2 2 2
## [2998] 2 2 2 2 1 2 2 1 2 2 1 2 2 2 1 2 2 1 2 1 2 1 2 2 1 2 2 1 1 2 2 1 2 2 1 2 2
## [3035] 2 1 1 1 1 1 2 2 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [3072] 2 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 2 1 1 2 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1
## [3109] 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 2 2 2 2 2 2 2 2 2 2 2 1 1 2 1 1 2 2 1 2 1
## [3146] 1 1 1 1 1 1 1 1 2 2 2 1 1 2 2 2 2 1 1 1 1 2 1 2 2 1 1 1 1 2 2 2 1 2 1 2 1
## [3183] 1 2 1 1 1 1 2 1 2 1 2 1 2 2 1 1 1 2 1 2 1 1 2 2 1 1 2 2 2 2 1 1 1 1 2 1 1
## [3220] 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1 2 1 2 1 1 1 1 1 2 2 1 2 2 2 1 1 2 1
## [3257] 2 2 1 1 1 1 2 2 2 1 2 1 2 2 2 2 1 1 2 2 1 1 2 1 2 1 1 1 1 1 2 1 1 1 1 1 1
## [3294] 2 2 2 1 1 2 1 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 1 1 2 2 2 1
## [3331] 1 2 1 1 1 2 1 1 2 2 1 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2 2 1 2 2 2 1 2 2 1
## [3368] 1 2 2 2 2 1 2 1 2 2 1 2 2 1 1 2 1 1 1 1 1 2 1 1 2 2 2 1 1 1 1 2 1 2 2 2 1
## [3405] 2 2 2 2 2 2 2 2 2 2 1 2 2 1 1 2 1 1 2 2 2 1 2 1 1 2 2 2 1 2 2 2 2 2 1 2 2
## [3442] 1 2 1 1 2 2 1 2 2 2 2 1 1 1 2 1 1 1 2 2 2 1 1 2 2 2 1 1 1 1 1 2 1 1 2 2 1
## [3479] 1 2 1 1 1 2 1 1 1 1 2 2 2 2 2 1 1 2 2 2 1 2 1 2 2 2 2 1 2 2 1 1 1 2 2 1 2
## [3516] 2 2 2 2 1 2 2 2 2 2 2 1 1 2 2 2 2 2 2 1 2 2 2 2 1 1 1 2 2 1 2 1 2 2 2 2 1
## [3553] 2 2 2 2 2 2 1 2 1 2 2 1 2 1 2 1 2 1 2 2 1 2 2 2 2 2 2 2 2 2 1 2 2
## 
## Within cluster sum of squares by cluster:
## [1]  43.9276 102.5454
##  (between_SS / total_SS =  87.0 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
clusters_boston = as.factor(kmeans_obj_boston$cluster)

Visualizations

Cluster Scatterplot

neighborhood_clusters = as.factor(kmeans_obj_boston$cluster)

ggplot(boston, aes(x = price, 
                            y = review_scores_rating,
                            color = neighborhood_groups,
                            shape = neighborhood_clusters)) + 
  geom_point(size = 2) +
  ggtitle("Price vs Rating of Boston Airbnbs") +
  xlab("Price per Night") +
  ylab("Review Score (out of 100)") +
  scale_shape_manual(name = "Cluster", 
                     labels = c("Cluster 1", "Cluster 2"),
                     values = c("1", "2")) +
  theme_light()

Map of Airbnbs and Their Clusters

We thought a map where you could see the physical location of the properties would be the best way to visualize our model, so we appended the clusters to our dataset and installed the required packages.

boston$clusters <- neighborhood_clusters
## 
##   
   checking for file ‘/private/var/folders/1b/zdyh7c3s3rj0lc18xfvdx88w0000gn/T/Rtmpbndyy1/remotesf1d446c76b3f/dkahle-ggmap-2d756e5/DESCRIPTION’ ...
  
✓  checking for file ‘/private/var/folders/1b/zdyh7c3s3rj0lc18xfvdx88w0000gn/T/Rtmpbndyy1/remotesf1d446c76b3f/dkahle-ggmap-2d756e5/DESCRIPTION’ (747ms)
## 
  
─  preparing ‘ggmap’:
##    checking DESCRIPTION meta-information ...
  
✓  checking DESCRIPTION meta-information
## 
  
─  checking for LF line-endings in source and make files and shell scripts
## 
  
─  checking for empty or unneeded directories
##    Removed empty directory ‘ggmap/.github’
## 
  
─  building ‘ggmap_3.0.0.tar.gz’
## 
  
   
## 

Since this map is not interactive, we adjusted the center of the map to fit all of the properties and zoomed as far as we could without Airbnbs getting cut off. Here, of the 3585 observations, only 5 did not fit in the map.

map1 <- ggmap(get_googlemap(center = c(lon = -71.0759, lat = 42.319),
                    zoom = 12, scale = 2,
                    maptype ='terrain',
                    color = 'color'))+ 
  geom_point(aes(x = longitude, y = latitude,  colour = clusters), data = boston, size = 0.5) + 
  theme(legend.position="bottom")
map1
## Warning: Removed 5 rows containing missing values (geom_point).

Zoomed Map

We ran the map again, this time zooming in on central Boston to get a clearer look at it. Only 1747 properties are shown in this map, but we see here that our model is doing a pretty good job at predicting Airbnbs in Central Boston.

map_zoomed <- ggmap(get_googlemap(center = c(lon = -71.0759, lat = 42.35101),
                    zoom = 14, scale = 2,
                    maptype ='terrain',
                    color = 'color'))+ 
  geom_point(aes(x = longitude, y = latitude,  colour = clusters), data = boston, size = 0.5) + 
  theme(legend.position="bottom")
map_zoomed
## Warning: Removed 1838 rows containing missing values (geom_point).

Counfustion Matrix

Next we wanted to get a confusion matrix based on clusters to check our accuracy.

We started by making the neighborhood groups and cluster assignment factors

clust_boston$neighborhood_groups <- boston$neighborhood_groups
clust_boston$clusters <- neighborhood_clusters

clust_boston[,c(4,5)] <- lapply(clust_boston[,c(4,5)], as.factor)

And then partitioned into train, tune, test sets, and assigned our features/target

train_index <- createDataPartition(clust_boston$neighborhood_groups,
                                           p = .7,
                                           list = FALSE,
                                           times = 1)
train <- clust_boston[train_index,]
tune_and_test <- clust_boston[-train_index, ]


tune_and_test_index <- createDataPartition(tune_and_test$neighborhood_groups,
                                           p = .5,
                                           list = FALSE,
                                           times = 1)

tune <- tune_and_test[tune_and_test_index, ]
test <- tune_and_test[-tune_and_test_index, ]

features <- as.data.frame(train[,-c(4)])
target <- train$neighborhood_groups

And finally we ran the model to get our confusion matrix, and checked the variable importance.

set.seed(123)
boston_dt <- train(x=features,
                    y=target,
                    method="rpart")
 
varImp(boston_dt)
## rpart variable importance
## 
##                      Overall
## price                 100.00
## room_type              57.27
## clusters               56.66
## review_scores_rating    0.00
dt_predict_1 = predict(boston_dt,tune,type= "raw")

confusionMatrix(as.factor(dt_predict_1), 
                as.factor(tune$neighborhood_groups), 
                dnn=c("Prediction", "Actual"), 
                mode = "sens_spec")
## Confusion Matrix and Statistics
## 
##                 Actual
## Prediction       Suburbs Central_Boston
##   Suburbs            141             71
##   Central_Boston      63            263
##                                           
##                Accuracy : 0.7509          
##                  95% CI : (0.7121, 0.7869)
##     No Information Rate : 0.6208          
##     P-Value [Acc > NIR] : 1.015e-10       
##                                           
##                   Kappa : 0.475           
##                                           
##  Mcnemar's Test P-Value : 0.5454          
##                                           
##             Sensitivity : 0.6912          
##             Specificity : 0.7874          
##          Pos Pred Value : 0.6651          
##          Neg Pred Value : 0.8067          
##              Prevalence : 0.3792          
##          Detection Rate : 0.2621          
##    Detection Prevalence : 0.3941          
##       Balanced Accuracy : 0.7393          
##                                           
##        'Positive' Class : Suburbs         
## 

Conclusions

Our accuracy is 72% which isn’t the best, but we think our model did a good job predicting the neighborhood with the information it was given. We think that if we could have included more information that was in our original dataset, such as amenities, parking availability, and transit information, our model could have likely been more accurate, and we could have even split up our neighborhood groups more. (Maybe groups of downtown, central, suburbs). Unfortunately these variables were in sentence format written by the host, and were difficult to sort through and use in our clustering model.

The variable importance output shows us that price was by far the most important variable for predicting neighborhood, which is the relationship we wanted to explore in the beginning. Using our map, we think our model is a valuable tool for finding Airbnbs that could be considered a good deal. Since price was used 100% of the time to predict our clusters, we can draw conclustions about the relative price of an Airbnb compared to the neighborhood it is in. Any Airbnb which was actually located in Central Boston but predicted as the Suburbs likely have a price much lower than other similar Airbnbs in Central Boston. The opposite is also true, where incorrectly classifies properties in the Suburbs are likely more expensive than others.